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Abstract 



o 
o 

(N 

r^ ■ Recent trends in information nranagement involve the periodic transcription of data onto 

^ I secondary devices in a networked environment, and the proper scheduhng of these transcriptions 

is critical for efficient data management. To assist in the scheduling process, we are interested 

in modeling data obsolescence, that is, the reduction of consistency over time between a relation 

and its replica. The modeling is based on techniques from the field of stochastic processes, and 

provides several stochastic models for content evolution in the base relations of a database, taking 

PQ ■ referential integrity constraints into account. These models are general enough to accommodate 

^*\ I most of the common scenarios in databases, including batch insertions and life spans both with 

• ■ and without memory. As an initial "proof of concept" of the applicability of our approach, we 

Q I validate the insertion portion of our model framework via experiments with real data feeds. We 

also discuss a set of transcription protocols which make use of the proposed stochastic model. 

(N' 

^ ■ 1 Introduction and motivation 

^D ' Recent developments in information management involve the transcription of data onto secondary 

devices in a networked environment, e.g., materialized views in data warehouses and search engines, 

^^ I and replicas in pervasive systems. Data transcription influences the way databases define and 

^p I maintain consistency. In particular, the networked environment may require periodic (rather than 

y^ • continuous) synchronization between the database and secondary copies, either due to paucity of 

resources {e.g., low bandwidth or limited night windows) or to the transient characteristics of the 
connection. Hence, the consistency of the information in secondary copies, with respect to the 

^ ' transcription origin, varies over time and depends on the rate of change of the base data and on 

the frequency of synchronization. 

Systematic approaches to the proper scheduling of transcriptions necessarily involve optimizing 
a trade-off between the cost of transcribing fresh information versus the cost of using obsolescent 
data. To do so, one must quantify, at least in probabilistic terms, this latter cost, which we 



call obsolescence cost [11|. This paper aims to provide a comprehensive stochastic framework 
for quantifying time-dependent data obsolescence in replicas. Suppose we are given a relation 
R, a start time s £ ^, and some later time f > s. We denote the extension of a relation R 
at time t € K by R{t). Starting from a known extension R{s), we are interested in making 
probabilistic predictions about the contents of the later extension R{f). We also suggest a cost 
model schema to quantify the difference between R{s) and R{f). Such tools assist in optimizing the 
synchronization process, as demonstrated in this paper. Our approach is based on techniques from 
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the field of stochastic processes, and provides several stochastic models for content evolution in a 
relational database, taking referential integrity constraints into account. In particular, we make 



use of compound nonhomogeneous Poisson models and Markov chains; see for example [28, 29, |3^ . 
We use Poisson processes to model the behavior of tuples entering and departing relations, allowing 
(nonhomogeneous) time- varying behavior — e.g., more intensive activity during work hours, and 
less intensive activity after hours and on weekends — as well as compound (bulk) insertions, that 
is, the simultaneous arrival of several tuples. We use Markov chains in a general modeling approach 
for attribute modifications, allowing the assignment of a new value to an attribute in a tuple to 
depend on its current value. The approach is general enough to accommodate most of the common 
scenarios in databases, including batch insertions and memoryless, as well as time dependent, life 
spans. 

As motivation, consider the following two examples: 

Example 1 (Query optimization) Query optimization relies heavily on estimating the cardinal- 
ity and value distribution of relations in a database. If these statistics are outdated and inaccurate, 
even the best query optimizer may formulate poor execution plans. Typically, statistics are updated 
only periodically, usually at the discretion of the database administrator, using utilities such as 
DB2's RUNSTATS. Although some research has been devoted to speeding up statistics collection 
through sampling and wavelet approximations /^J, W^] > periodic updates are unavoidable in very 
large databases such as IBM's Net. Commerce p^/ , an e-business software package with roughly one 
hundred relations, or an SAP application, which has more than 8,000 relations and 9,000 indices. 
Collection of statistics becomes an even more acute problem in database federations WaL where the 
federation members do not always "volunteer" their statistics |2^ (or their cost models for that 
matter f^^), and are unwilling to burden their resources with frequent statistics collection. 

In current practice, cardinality or histogram data recorded at time s are used unchanged until 
the next full analysis of the database at some later time s' > s. If a query optimization must be 
performed at some time f G (s, s'), the optimizer simply uses the statistics gathered at time s, since 
the time spent recomputing them may overwhelm any benefits of the query optimization. As an 
alternative, we suggest using a probabilistic estimate of the necessary statistics at time f . Use of 
these techniques might make it possible to increase the interval between statistics- gathering scans, 
as will be discussed in Example [^. □ 

Example 2 (Replication management in distributed databases) We now consider replica- 
tion management in a distributed database. Since fully synchronous replication management, in 
which a user is guaranteed access to the most current data, comes at a significant computational 
cost, most commercial distributed database providers have adopted asynchronous replication man- 
agement. That is, updates to relation replicas are performed after the original transaction has 
committed, in accordance with the workload of the machine on which the secondary copy is stored. 
Asynchronous replicas are also very common in Web applications such as search engines, where 
Web crawlers sample Web sites periodically, and in pervasive systems (e.g., Microsoft's Mobile In- 
formation Servc^ and Cafe Centra^. In a pervasive system, a server serves many different users, 
each with her own unpredictable connectivity schedule and dynamically changing device capabilities. 
Our modeling techniques would allow client devices to reduce the rate at which they poll the server, 
saving both server resources and network bandwidth. We demonstrate the usefulness of stochastic 



modeling in this setting in Sections 3.1 and 5.6. □ 



^ http://www.microsoft.com/servers/miserver/ 
^http://www. comalex.com/central.htm 



The novelty of this paper is in developing a formal framework for modeling content evolution in 
relational databases. The problem of content evolution with respect to materialized views (which 
may be regarded as a complex form of data transcription) in databases has already been recognized. 
For example, in [0, the incompleteness of data in views was noted as being a "dynamic notion since 
data may be constantly added/removed from the view." Yet, we believe that there has been no prior 
formal modeling of the evolution process.^ Related research involves the containment property of 
a materialized view with respect to its base data: a few of the many references in this area include 
|38, ^, |l9|, |2[ |l3|. However, the temporal aspects of content evolution have not been systematically 



addressed in this work. In M, for example, the containment relationships between a materialized 
view / and the "true" query result V{D), taken from a database D, can be either I = V{D) 
or / C V{D). The latter relationship represents a situation where the materialized view stores 
only a partial subset of the query result. However, taking content evolution into account, it is 
also possible that / D V(Z)), if tuples may be deleted from V{D) and / is periodically updated. 
Moreover, modifications to the base data may result in both / ^ V(-D) and / ^ "1^(-^)- 

Refresh policies for materialized views have been previously discussed in the literature {e.g., 



1 20(1 and P). Typically, materialized views are refreshed immediately upon updates to the base 
data, at query time (as in Q), or using snapshot databases (as in [^). The latter approach can 
produce obsolescent materialized views. A combination of all three approaches appears in Q. Our 
methodology differs in that we do not assume an a priori association of a materialized view with 
a refresh policy, but instead design policies based on their transcription and obsolescence costs. 

A preliminary attempt to describe the time dependency of updates in the context of Web 
management was given in ||^], which suggests a simple homogeneous Poisson process to model the 
updating of Web pages. We suggest instead a nonhomogeneous compound Poisson model, which is 
far more flexible, and yet still tractable. In addition, the work in [^ supposes that transcriptions 
are performed at uniform time intervals, mainly because "crawlers cannot guess the best time to 
visit each site." We show in this paper that our model of content evolution gives rise to other, 
better transcription policies. 



In 1 25 1, a trade-off mechanism was suggested to decide between the use of a cache or recompu- 
tation from base data by using range data, computed at the source. In this framework, an update 
is "pushed" to a replication site whenever updated data falls outside a predetermined interval, or 
whenever a query requires current data. The former requires the client and the server to be in 
touch continuously, in case the server needs to track down the client, which is not always realistic 
(either because the server does not provide such services, or because the overhead for such services 
undermines the cost-effectiveness of the client). The latter requirement puts the burden of deciding 
whether to refresh the data on the client, without providing it with any model for the evolution of 
the base data. We attempt to flll this gap by providing a stochastic model for content evolution, 
which allows a client to make judicious requests for current data. Other work in related areas {e.g., 
||3|, |5|, |T^) has considered various alternatives for pushing updated data from a server to a cache on 
the client side. Lazy replica-update policies using replication graphs have also been discussed in, for 
example, 0. This work, however, does not take the data obsolescence into account, and is primarily 
concerned with transaction throughput and timely updates, subject to network constraints. 

As with models in general, our model is an idealized representation of a process. To be use- 
ful, we wish to make predictions based on tractable analytical calculations, rather than detailed, 
computationally intensive simulations. Therefore, we restrict our modeling to some of the more 
basic tools of applied probability theory, specifically those relating to Poisson processes and Markov 



^Other research efforts involve probabilistic database systems {e.g., [[18|), but this work is concerned with uncer- 
tainty in the stored data, rather than data evolution. 



chains. Texts such as |28, ^] contain the necessary reference material on Markov chains and Poisson 
processes, and specifically on nonhomogeneous Poisson processes. Poisson processes can model a 
world where data updates are independent from one another. In databases with widely distributed 
access, e.g., Web interfacing databases, such an independence assumption seems plausible, as was 
verified in Q. 

The rest of the paper is organized as follows: Section ^^ introduces some basic notation. 
Section |2| provides a content evolution model for insertions and deletions, while Section y discusses 
data modifications. We shall introduce preliminary results of fitting the insertion model parameters 
to real data feeds in Section 0. A cost model and transcription policies that utilize it follow in Section 
|5|, highlighting the practical impact of the model. Conclusions and topics for further research are 
provided in Section ^ 

1.1 Notational preliminaries 

In what follows, we denote the set of attributes and relations in the database by B and TZ, re- 
spectively. Each R €z TZ consists of a set of attributes A{R) C B, and also has a primary 
key IC{R), which is a nonempty subset of A{R). Each attribute A ^ B has a domain domA, 
which we assume to be a finite set, and for any subset of attributes A = {Ai,A2, ...,Ak}, we let 
dom^ = dom^i x dom^2 x ••• x dom A^ denote the compound domain of A. We denote by r.A(t) 
the value of attribute A in tuple r at time t, and similarly use r.A{t) for the value of a compound 
attribute. For a given time t, subset of attributes A C A{R), and value v = {vi, V2, ..., v^) S dom^, 
we define RaA^) = {r ^ R{t) \ {r.Ai{t) = vi) A {r.A2{t) = ^2) A . . . A {r.Ak{t) = Vk) }■ We also 
define -Ryt(t) to be the histogram, of values of A at time t, that is, for each value v £ dom^, i?^(t) 
associates a nonnegative integer RA,v{'t), which is the cardinality of i?^,i,(t)Q This notation, and 
well as other symbols used throughout the paper, are also summarized in Table ||. 

2 Modeling insertions and deletions 

This section introduces the stochastic models for insertions and deletions. Section p. 11 discusses 



insertions, while deletions are discussed in section \2.2[ Section 2^ combines the effect of insertions 
and deletions on a relation's cardinality. We conclude with a discussion of non-exponential life 
spans in Section 2A. We defer discussing model validation until Section §. 



2.1 Insertion 

We use a nonhomogeneous Poisson process [25, ^] with instantaneous arrival rate Xr : ^ —>■ [0, cxd) 
to model the occurrence of insertion events into R. That is, the number of insertion events occurring 
in any interval (s, /] is a Poisson random variable with expected value A/j(s, /) = J A/j(t) dt. A 
homogeneous Poisson process may be considered as the special case where Xnit) is equal to a 



s . 



constant Xr > for ah t, yielding Ar{s, /) = // Aij(t) dt = // XRdt = Xr- {f - s) 

We now consider the interarrival time distribution of the nonhomogeneous Poisson process. We 
first define the nonhomogeneous exponential distribution, as follows: 



*This vector can be computed exactly and efficiently using indices. Alternatively, in the absence of an index 
for a given attribute, statistical methods (such as "probabilistic" counting IS^I, sampling-based estimators uM, and 
wavelets |23|) can be applied. 
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Table 1: List of Symbols. 



Definition 1 (Nonhomogeneous exponential distribution) Let : 3f? ^ [0, oo) be a inte- 
grahle function. Given some s £ ?R., a random variable V is said to have a nonhomogeneous 
exponential distribution (denoted by V ^ Expg((^(-))j ifV's density function is 



p{t) 



(s + T) exp 



0, 



{s + u) du] , r > 
r <0. 



It is worth noting that if </>(t) is constant, p(r) is just a standard exponential distribution. We 
shall now show that, as with homogeneous Poisson processes, the interarrival time of insertion 
events is distributed like an exponential random variable, L^^g, but with a time- varying density 
function. 

Lemma 1 At any time s, the amount of time Lji^g to the next insertion event is distributed like 
Exp^(A/j(-)). The probability of an insertion event occurring during {s,f] is P{Ljis < f — s} = 



1 



-Afl(s,/) 



Proof. Let {N{t),t > 0} be a nonhomogeneous Poisson process with intensity function Xn^t), 
which imphes P{A^(/) — N{s) = 0} = e~^'*'-^^ Now, the chance that no new tuple was inserted 
during (s, /] is the same as the chance that the process N(-) has no arrivals during (s, /], that is, 
g-Afl(s,/) rj^Yie chance that a new tuple was inserted during (s,/] is just the complement of the 
chance of no arrivals, namely, 

P{Lr,s< f-s} = {N{f) - Nis) >!} = !- P{N{f) - N{s) = 0} = 1 - g-^^^^'^). 

Taking the derivative of this expression with respect to / and making a change of variables, the 
probability density of the time until the next insertion from time s is p{t) = Aij(s + 7-)e~^*-'*'*^'^'. 
Thus, Lr^s ~ Exp^(A/?(-)). ■ 

At insertion event i, a random number of tuples A^^ are inserted, allowing us to model bulk 
insertions. A bulk insertion is the simultaneous arrival of multiple tuples, and may occur because the 
tuples are related, or because of limitations in the implementation of the server. For example, e-mail 
servers may process an input stream periodically, resulting in bulk updates of a mailbox. Assuming 
that the {A^^} are independent and identically distributed (IID), then the stochastic process 
{Bji(t),t > 0} representing the cumulative number of insertions through time t is a compound 
Poisson process {e.g., |2^, pp. 87-88). We let Bji{s,f) denote the number of insertions falling 
into the interval (s,/]. The expected number of inserted tuples during (s, /] may be computed 
via E[Bii{s,f)] = jfXR{t)E[A+] dt = E [A+] ffXR{t)dt = E[A+] Ah(s,/). Here, A+ represents a 
generic random variable distributed like the {A^^}. 

We now consider three simple cases of this model: 

General nonhomogeneous Poisson process: Assume that E[A^] = 1. The expected number 
of insertions simplifies to E[Br{s, /)] = E[A+] Ar{s, /) = 1 • Ar{s, f) = Ar(s, /). 

Homogeneous Poisson process: Assume once more that E[A^1 = 1. Assume further that 
A/j,(t) is a constant function, that is, Aij(t) = Xr for all times t. In this case, as shown above, 
Ar{s, /) takes on the simple form of Xr- {f — s). Thus, E[Br{s, /)] = A/j(s, /) = Xr- {f — .s). The 
interarrival times are distributed as Exp(Ar), the exponential distribution with parameter Xr. 

Recurrent piecewise-constant Poisson process: A simple kind of nonhomogeneous Poisson 
process can be built out of homogeneous Poisson processes that repeat in a cyclic pattern. Given 
some length of time T, such as one day or one week, suppose that the arrival rate function A/j(t) 
of the recurrent Poisson process repeats every T time units, that is, XR(t) = XR(t — T[t/T\ ) for all 
t. Furthermore, the interval [0,T) is partitioned into a finite number of subsets Ji, . . . , Jk, with 
Aij(t) constant throughout each J^, k = 1, . . . ,K. Finally, each J^ is in turn composed of a finite 
number of half-open intervals of the form [s, /). For instance, T might be one day, with K = 24 and 
Ji = [0:00,1:00), J2 = [1:00,2:00),... , J24 = [23:00,0:00). As another simple example, T might 
be one week, and K = 2. The subset Ji would consist of a firm's normal hours of operation, say 
[9:00,18:00) for each weekday, and J2 = [0,T)\Ji would denote all "off-hour" times. Formalisms 
like those of [|2j] could also be used to describe such processes in a more structured way. We term 
this class of Poisson processes to be recurrent piecewise-constant — abbreviated RPC. 

It is worth noting that, in client-server environments, the insertion model should typically be 
formed from the client's point of view. Therefore, if the server keeps a database from which many 
clients transcribe data, the modeling of insertions for a given client should only include the part 
of the database the client actually transcribes. Therefore, if a "road warrior" is interested only in 



new orders for the 08904 zip code area, the insertion model for that cUent should concentrate on 
that zip code, ignoring the arrival orders from other areas. 

2.1.1 The complexity of computing Ar{s, f) 

A_r(s,/), the Poisson expected value, is computed by integrating the model parameter A_R(t) over 
the interval [s,f]. Standard numerical methods allow rapid approximation of this definite integral 
even if no closed formula is known for the indefinite integral. However, the complexity of this 
calculation depends on the information-theoretic properties of Aij(i) ||3^, Section 1]. 

For our purposes, however, simple models of XR{t) are likely to suffice. For example, if Aij(i) is 
a polynomial of degree d > 0, the integration can be performed in 0{d + 1) time. Consider next a 
piecewise-polynomial Poisson process: the time line is divided into intervals such that, in each time 
interval, Ai?(t) can be written as a polynomial. The complexity of calculating Ai^(s, /) in this case 
is 0(n(d + 1)), where n is the number of segments in the time interval (s, /], and d is the highest 
degree of the n polynomials. 

Further suppose that the piecewise-polynomial process is recurrent in a similar manner to the 
RPC process, that is, given some fixed time interval T, XR{t) = A/j(t — T[t/T\ ) for all t. Note that 
the RPC Poisson process is the special case of this model in which d = 0. If there are c segments 
in the interval [0,T], then the complexity of calculating A/{(s,/) becomes 0(c(d -|- 1)), regardless 
of the length of the interval [s,f]. This reduction occurs because, for all intervals of the form 

( L,_\_-\ \'T^ 'f 

[kT, [k + l)r] C [s, /] for which k is an integer, the integral j^j, Aij(t) dt is equal to /q A/j(t) dt, 
which only needs to be calculated once. 

In Section ^, we demonstrate the usefulness of the RPC model for one specific application. We 
hypothesize that a recurrent piecewise-polynomial process of modest degree (for example, d=2>) will 
be sufficient to model most systems we are likely to encounter, and so the complexity of computing 
Aij(s, /) should be very manageable. 

2.2 Deletion 

We allow for two distinct deletion mechanisms. First, we assume individual tuples have their 
own intrinsic stochastic life spans. Second, we assume that tuples are deleted to satisfy referential 
integrity constraints when tuples in other relations are deleted. These two mechanisms are combined 
in a tuple's overall probability of being deleted. Let R and S be two relations such that /C(S') is a 
foreign key of S in R. We refer to S as a primary relation of R. Consider the directed multigraph 
G whose vertices consist of all relations R in the database, and whose edges are of the form {R, S), 
where S" is a primary relation of R. The number of edges {R, S) is the number of foreign keys of S 
in R for which integrity constraints are enforced. We assume that G{R) has no directed cycles. Let 
G{R) denote the subgraph of G consisting of R and all directed paths starting at R. We denote 
the vertices of this subgraph by S{R). 

Example 3 (Referential integrity constraints in Net. Commerce) IBM's Net. Commerce is 
supported by a DB2 database with about a hundred relations interrelated through foreign keys. For 
demonstration purposes, consider a sample of seven relations in the Net. Commerce database. Figure 
^is a pictorial representation of the multigraph G of these seven relations. The MERCHANT relation 
provides data about merchant profiles, the SCALE and DISCCALC relations are for computing price 
discounts, the CATEGORY and CGRYREL relations assist in categorizing products, and the ORDERS 
and SHIPTO relations contain information about orders. The six relations, SCALE, DISCCALC, 
CATEGORY, CGRYREL, ORDERS, and SHIPTO have a foreign key to the MERCHANT relation, through 
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Figure 1: A partial multigraph of the case study. 



merchant's primary key (MERFNBR). Integrity constraints are enforced between the SCALE relation 
and the MERCHANT relation, as long as SCALE . SCLMENBR (the foreign key to MERCHANT . MERFNBR) 
does not have the value NULL. That is, unless a NULL value is assigned to the MERCHANT . MERFNBR 
attribute, a deletion of a tuple in MERCHANT results in a deletion of all tuples in SCALE such that 
SCALE. SCLMENBR = MERCHANT. MERFNBR. DISCCALC has a foreign key to the SCALE relation, through 
scale's primary key (SCLRFNBR). There are two attributes of CGRYREL that serve as foreign keys 
to the CATEGORY relation, through CATEGORY'S primary key (CGRFNBR). Finally, SHIPTO contains 
shipment information of each product in an order, and therefore it has a foreign key to ORDERS 
through its primary key (ORFNBR). □ 

With regard to intrinsic deletions within a relation, we assume that each tuple r G R{s) has 
a stochastic remaining life span L^ ^ . This random variable is identically distributed for each 
r £ R{s), and is independent of the remaining life span of any other tuple and of r's age at time 
s (see Section ^^ for a discussion of tuples with a non-memoryless life span). Specifically, we will 
assume that the chance of r G R{t) being deleted in the time interval [t, t + At] approaches fj,ji{t)At 
as At — > 0, for some function fiR : ^ —>■ [0, oo). We define Mji{s, f) = Jj fiR{t) dt. 

Lemma 2 Lj^g ~ Expg(/i /?(•)). The probability that a tuple r G R{s) is deleted by time f, given 
that no corresponding tuple in S{R)\{R} is deleted, is P{L^g < f — s} =1 — e~^^^'^'^'. 

Proof. Let r € R{s) be a randomly chosen tuple, and assume that no corresponding tuple to r in 
S{R)\{R} is deleted. The proof is identical to that of Lemma ||, replacing Xnit) with ^R{t) and 
Afi(s,/)byMR(s,/). ■ 



2.2.1 Deletion and referential integrity 

For any r G R{s) and any relation S G S{R), we define w{r, S) to be the number of tuples in S 
whose deletion would force deletion of r in order to maintain referential integrity. This value can 
be between and the number of paths from i? to S* in G{R). For example, if r G i? = CGRYREL of 
Figure |l], then < ii;(r, CATEGORY) < 2 and < u;(r, MERCHANT) < 3. For completeness, we define 



w{r,R) = 1. Each tuple in 5 has an independent remaining Ufetime distributed as Exp^(/is(-)), 
and if any of the w;(r, S) tuples corresponding to r is deleted, then r must be immediately deleted, 
to maintain referential integrity constraints. We use pr{s, /) to denote the probability that a 
randomly chosen tuple in R{s) survives until time /. 



Lemma 3 pr{sJ) = ^r(iR{s)\^y^vy-Y.seS{B)'^i^^ ^)^s{s, f)j\^, where E^g^(5)[-] denotes expecta- 
tion over random selection of tuples in R{s) . 

Proof. Considering all S G S{R), and using the well-known fact that if Lj ~ 'Expg{fj,i{-)) for 
i = 1, . . . ,k are independent, then 

min{Li,... ,Lfc} ~ExpJ ^/Xi(-) j , (1) 

we conclude that the remaining lifetime of r (denoted Lrs) has a nonhomogeneous exponential 
distribution with intensity function ^^g^/m w{r, S)ij:s{-)- The probability of a given tuple r S R{s) 
surviving through time / is thus 



exp 



/ ( ^w{r,S)fisit)]dt] =expi-Y,^ir,S)MsisJ)\, 
•^"^ \S&S{R) J J \ SeS(R) J 



and the probability that a randomly chosen tuple in R{s) survives until time / is therefore 



PR{s,f) =E^g^(^) 



exp[-Y,wir,S)Ms{s,f) 
SeS{R) 



(2) 



The complexity analysis of integrating /is(i) over time to obtain Ms{s,f) is similar to that of 
Section 2.1.1. However, the computation required by Lemma ^ may be prohibitive, in the most 



general case, because it requires knowing the empirical distribution of the w{r, S) over all r S R{s) 
for all S € S{R). This empirical distribution can be computed accurately by computing for each 
tuple, upon insertion, the number of tuples in any S E S{R) with a comparable foreign key, using 
either histograms or by directly querying the database. Maintaining this information requires 
0(|i?(s)| |5(i?)|) space. This complexity can be reduced using a manageably-sized sample from 
R{s). Our initial analysis of real-world applications, however, indicates that in many cases, w{r, S) 
takes on a much simpler form, in which w{r, S) is identical for all r S R(s). We term such a typical 
relationship between R and S £ S{R) a fixed multiplicity, as defined next: 

Definition 2 The pair {R, S), where S G S{R), has fixed multiplicity if w{r, S) is identical for all 
tuples in R. In this case, we denote its common value by w{R, S). □ 

Example 4 (Fixed multiplicies in Net. Commerce) Consider the example multigraph of Fig- 
ure 0. Both DISCALC and SCALE reference MERCHANT. R is clear that the discount calculation of a 
product (as stored in DISCALC) cannot reference a different merchant than SCALE. The only excep- 
tion is when the foreign key in SCALE is assigned with a null value. If this is the case, however, there 
is only a single tuple in MERCHANT whose deletion requires the deletion of a tuple in DISCALC. Thus, 



for any tuple r G DISCALC, w{r, SCALE) = w{r, MERCHANT) = 1 and therefore {DISCALC, SCALE!) and 
{DISCALC, MERCHANT) both have fixed multiplicity of 1. Now consider CGRYREL. Since each tuple in 
CGRYREL describes the relationship between a category and a subcategory, it is clear that its two 
foreign keys to CATEGORY must always have distinct values. Thus, {CGRYREL, CATEGORY) has a fixed 
multiplicity, and w {CGRYREL, CATEGORY) = 2. □ 

As the following lemma shows, fixed multiplicities permit great simplification in computing 
PR{s,f)- 



Lemma 4 // {R,S) has fixed multiplicity for all S £ S{R), PR{s,f) 
Mr{s, f) = // jlRit) dt and flR{t) =J2seS(R) ^(■^' S)l^s{t)- 



exp{-MR{s,f)), where 



Proof. 



Pr{sJ) = EreR(s) 



'r€R{s) 



exp[-Y,<r,S)Ms{s,f) 
seS(R) 

fl 



'reR{s) 



^^P ~ / X] '^(^^^)l^sit) dt 

\ •'' \S&S(R) I J 

exp - / J^u;(i?,5)^5(t) dt 
\ '''' \S(iS(R) ' 

*^^P -Z] (^(^'^) I l^s{t)dt\ 
^M-Y.w{R,S)Ms{s,f) 

\ S€S{R) 



Since w{R, S) is fixed and constant over time, no additional statistics need to be collected for 
it. As a final note, it is worth noting that in certain situations, another alternative may also be 
available. Let {NR{t),t > 0} be a nonhomogeneous Poisson process with intensity function //^(t), 
modeling the occurrence of deletion events in R. At deletion event i, a random number A~ tuples 
are deleted from R. Generally speaking, this kind of model cannot be accurate, since it ignores 
that each deletion causes a reduction in the number of remaining tuples, and thus presumably a 
change in the spacing of subsequent deletion events. However, it may be reasonably accurate for 
large databases with either a stable or steadily growing number of tuples, or whenever the time 
interval (s, /] is sufficiently small. Statistical analysis of the database log would be required to say 
whether the model is applicable. If the model is valid, then the stochastic process {DR{t),t > 0} 
representing the cumulative number of deletions through time t, can be taken to be a compound 
Poisson process. The expected number of deleted tuples during {s, f] may be computed via 



E[DR{t)] = I fiR{t) E [A-] dt = Mr{s, f) E [A' 



where A is a generic random variable distributed like the {A^ }. 
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2.3 Tuple survival: the combined effect of insertions and deletions 

Some tuples inserted during {s,f] may be deleted by time /. Let the random variable Xji{s,f) 
denote the number of tuples inserted during the interval {s, f] that survive through time /. Consider 
any tuple inserted into R at time t G (s, /], and denote its chance of surviving through time / by 
Pnit, /). For any S G S{R) and t S (s, /], let W{R, S, t) be a random variable denoting the value 
of w{r,S), given that r was inserted into R at time t. Let W{R,t) denote the random vector, of 
length \S{R)\, formed by concatenating the W{R,S,t) for ah S £ S{R). 



Lemma 5 pnit, f) = ^w{R,t) expf — ^^^^/^n Ty(i?, 5, t)M5(t, /) j . When {R,S) has fixed mul- 
tiplicity for all S G S{R), then pji{t, f) = pR{t, /) = exp(-M/j(t, /)). 

Proof. Let L/j^j denote the lifetime of a tuple inserted into R at time t. Similarly to the proof 
of Lemma y, we know that Lji^t ~ '^^Vt{YliS&S(R) ^(-^j 'S')*)/^5('))- The probability such a tuple 
survives through time / is the random quantity 



exp 



/ 5^Vr(i?,5,t)/i5(r) dr =exp -^VF(i?,5,t)M5(t,/) 
* \5eS{iJ) / / \ S&S(R) J 



Considering all the possible elements of the vector W{R^t\ we then obtain 



Pi?(t,/) = ^W{R,t) 



exp(-^l^(ii,S,t)M5(t,/) 

5G5(i?) 



Assume now that {R,S) has fixed multiplicity for all 5 G S{R). Consequently, we replace 
W{R, S, t) with w{R, S). Drawing on the proof of the previous lemma. 



Pi?(t, /) = ^W{R,t) 




expi-J2w{R,S)Ms{tJ) 
\seS(R) 



n Y.w{R,S)f,s{T)] dr] 
•^* \seS{R) J J 



The following proposition establishes the formula for the expected value of Xfi{s, /). 

Proposition 1 E[Xn{s,f)] = Ar{s, f)E[A^], where AnisJ) = jj XR{t)pR{t, f)dt. In the sim- 
ple case where each insertion involves exactly one tuple, Xji{s,f) ~ Poisson(A/j(s, /)). 

Proof. Let N be the number of insertion events in (s, /], and let their times be {Ti,T2, . . . , T/v}. 
Suppose that N = n and that insertion event i happens at time ti G (s,/]. Event i inserts 
a random number of tuples A^^, each of which has probability PR{ti, f) of surviving through 
time /. Therefore, the expected number of tuples surviving through / from insertion event i is 
E[A^] PR{ti,f). Consequently, 



E[XR{s,f) \ N = n,Ti = ti,T2 = t2, 



,Tn=tn] =E[A+]J2PRiti,f)- 
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Next, we recall, given that N = n, that the times Tj of the insertion events are distributed like 
n independent random variables with probability density function Xji{t)/AR{s, f) on the interval 
(«,/]• Thus, 



E[Xr{s, f) \ N = n] =Eti,...,t„ 



(E[A+]An{s,f)\ 



dt 



\ ARisJ) 
Finally, removing the conditioning on A^ = n, we obtain 



E[XR{s,f)]=E 



N 



N 



' E[A+]AR{s,f) \ 
'e[A+] Anisjy 



y AR{s,f) J 

= E[A+]An{sJ). 

In the case that Ajj is always 1, we may use the notion of a filtered Poisson process: if we 
consider only tuples that manage to survive until time /, the chance of a single insertion in time 
interval [t, t + Af] no longer has the limiting value XR{t)At, but instead XR{t)pR{t, f)At. Therefore, 
the insertion of surviving tuples can be viewed as a nonhomogeneous Poisson process with intensity 
function XR{t)pR{t, f) over the time interval (s,/], so X/j(s,/) ~ Poisson(AK(s, /)). ■ 

In the general case, the computation of A/j(s,/) will require approximation by numerical in- 
tegration techniques; the complexity of this calculation will depend on the information-theoretic 
properties of Xr{-) and the /is'(-)) "S* G S{R), but is unlikely to be burdensome if these functions are 
reasonably smoothly- varying. In one important special case, however, the complexity of computing 
AR(s,f) is essentially the same as that of calculating A/j(s,/): suppose that for some constants 
a{R, S), S G S{R), one has that fisit) = a{R, S)XR{t) for all t. That is, the general insertion and 
deletion activity level of the relations in S{R) all vary proportionally to some common fluctuation 
pattern. In this case, we have //^(t) = a{R)XR{t) and Mr{s, f) = a{R)AR{s, f) for all t, s, f, where 
a{R) = ^seS(R) "^(^i "S")- Making a substitution u{t) = AR{t, /), we have: 



Aij(s,/)= / XR{t)exp{-a{R)AR{t,f))dt 

s 

'f (-dAR{tJ) 



dt 



expi-aiR)ARitJ))dt 



- exp{—a{R)u{t)) du{t) 



MR)u ^^ 



u{s) 



R) \ ) 



a{R) 
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so Ar{s, f) can be calculated directly from A(s, /). 

We define the random variable Y^{s, f) to be the number of tuples in R{s) that survive through 
time /. 

Proposition 2 E[Yr{s, /)] = pr{s, f) \R{s)\ and E[\R{f)\] = pr{s, f) \R{s)\ + Ar{s, /) E[A+] . 

Proof. Each tuple in R{s) has a survival probability oi pr{s, f), which yields that E[y/^(s,/)] = 
Pr{si f) \R{s)\- By the definitions of Yr{s, f) and Xr{s, /), one has that 

\R{f)\=YR{sJ) + XR{sJ), 

so therefore 

E[\R{f)\] = E[Yr{s, /)] + E[Xr{s, /)] = pRis, f) \Ris)\ + A^(5, /) E [A^+] . 



In cases where deletions may also be accurately modeled as a compound Poisson process, we 
have 

E[\R{f)\] = E[\R{s)\] + Br{s, f) - Dr{s, f) 

= \R{s)\+KR{sJ-)E[^^]-MR{s,f)E[^-]. 

Example 5 (The homogeneous case) Assume that E[AJ^] = 1, that {R, S) has fixed multi- 
plicity for all S £ S{R), and furthermore Aij(t) and fJ-s{t), for all S G S{R), are constant func- 
tions, that is, A/j(t) = Xr for all times t and /xs'(t) = fis for all S £ S{R) and times t. Then 
Aij(s, f) = Xr- {f - s) and Mr{s, f) = ^r- {f - s). Thus, letting JIr = J2seS(R) ^(-^' ^)l^s, 



Ai?(s,/)= f XRC-i^^^'-'Ut = ^ (l 
Js l^R ^ 

and assuming A^ ^ = 1 for all i > {), 



^-JJ-Rif-s) 



E[\R{f)\] = \R{s)\ e-A^ij(/-^) + ^ f 1 - e-'^^'^f-'A = — + e'^^^-^-^) (\R{s)\ - — 



a 



2.4 Tuples with non-exponential life spans 

We now consider the possibility that tuples in R have a stochastic life span L^ that is not memory- 
less, but rather has some general cumulative distribution function Gr. For example, if tuples in R 
correspond to pieces of work in process on a production floor, the likelihood of deletion might rise 
the longer the tuple has been in existence. Let us consider a single relation, and thus no referential 
integrity constraints. For any tuple r, let b{r) denote the time it was created. We next establish 
the expected cardinality of R at time /. 

Proposition 3 In the case that tuples in R have lifetimes with a general cumulative distribution 
function Gr, 



E[\Rif)\] = \Ris)\E 



rG-R(s) 



l-GRif-bjr)) 
[l-GR{s-b{r))\ 
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+ E[A^]jxRit){l-GRif-t))dt. (3) 



Proof. Let L^ denote a generic random variable with cumulative distribution Gr. The probability 
of r G R{s) surviving throughout (s, /] is then 

P{..,/-.(.)i.i,..-%))^ l:g"j{::{:>; . 

and therefore the expected number of tuples in R{s) that survive through time / is 

'l-GR{f-h{T)) 

reR{s) 



EP'«(^./)l-E(l^§f^)Hi<W|E.„,„ 



1 - Gr{s - b{r)) 



We now consider a tuple r inserted at some time t € (s,/]- The probability that such a tuple 
survives through time / is simply pR{t, /) = 1 — Gii{f — t). By reasoning similar to the proof of 
Lemma g, 

nXR{s, /]] = E[Ar+] A^(s, /) = E[Afi+] f Afi(t) (1 - Gfi(/ - t)) dt. 

J s 

The conclusion then follows from E[|i?(/)|] = E[Yr(s, /)] + E[Xk(s, /)] ■ 

It is worth noting that, as opposed to the memoryless case presented above, the calculation 
of E[Yr(s,/)] requires remembering the commit times h{r) of all tuples r £ R{s), or equivalently 
the ages of all such tuples. Of course, for large relations R, a reasonable approximation could be 
obtained by using a manageably-sized sample to estimate 

-l-GRif-b{r)y 



ErgK(s) 



1 - Gr{s - b{r)) 



It is likely that the integral in (g|) will require general numerical integration, depending on the exact 
form of Gr. 

2.5 Summary 

In this section, we have provided a model for the insertion and deletion of tuples in a relational 
database. The immediate benefit of this model is the computation of the expected relation cardi- 
nality (E[|i?(/)|]), given an initial cardinality and insertion and tuple life span parameters. Relation 
cardinality has proven to be an important property in many database tools, including query opti- 
mization and database tuning. Reasonable assumptions regarding constant multiplicity allow, once 
appropriate statistics have been gathered, a rapid computation of cardinalities in this framework. 
Section H elaborates on statistics gathering and model validation. 

A note regarding tuples with non-exponential life spans is now warranted. For the case of a 
single relation, non-exponential life spans add only a moderate amount of complexity to our model, 
namely the requirement to store at least an approximation of the distribution of tuples ages in R{s). 
For multiple relations with referential integrity constraints, however, the complexity of dealing with 
general tuple life spans is much greater. First, to estimate the cardinality of R{f), we must keep 
(approximate) tuple age distributions for all relations in S{R). Second, because the tuple life span 
distributions of some of the members of S{R) are not memoryless, we cannot combine them with 
a simple relation like (|l]). Furthermore, in attempting to find the distribution of the remaining life 
span of a particular tuple r £ R{s), it may become necessary to consider the issue of the correlation 
of ages of tuples in R{s) with the ages of the corresponding tuples in other relations of S{R). 
Because of these complications, we defer further consideration of non-exponential tuple life spans 
to future research. 
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3 Modeling data modification 

This section describes various ways to model the modification of the contents of tuples. We start 
with a general approach, using Markov chains, followed by several special cases where the amount 
of computation can be greatly reduced. 

3.1 Content-dependent updates 

In this section, we model the modification of the contents of tuples as a finite-state continuous-time 
Markov chain, thus assuming dependence on tuples' previous contents. For each relation R, we 
allow for some (possibly empty) subset C{R) C A{R) of its attributes to be subject to change over 
the lifetime of a tuple. We do not permit primary key fields to be modified, that is, C{R)r\IC{R) = 0. 

Attribute values may change at time instants called transition events, which are the transition 
times of the Markov chain. We assume that the spacing of transition events is memoryless with 
respect to the age of a tuple (although it may depend on the time and the current value of the 
attribute, as demonstrated below). For any attribute A, tuple r, time s, and value v £ dom^ with 
r.A{s) = V, the time remaining until the next transition event for rA is a random variable Tv^s 
with the distribution Exp^(£i,' 7r,a(0)) where 7r,a : ^ -^ [0)Co) is a function giving the general 
instantaneous rate of change for the attribute, and £v' is a nonnegative scalar which we call the 
relative exit rate of v. We define Tji^A{s,f) = fJ^R^A{t)dt. When a transition event occurs from 
state u £ dom^, attribute A changes to f G dom^ with probability Pu,v ■ 

Suppose A = {Ai,A2,...,Ak} C 7^ is an independently varying set of attributes, and v = 
{vi,V2, ■■■,Vk) G dom^ is a compound value. Then the time until the next transition event for 
r.A is Tv^s = niin{rt,^'s ^ , . . . ,Tv,.\s''}- As a rule, we will assume that the modification processes 
for the attributes of a relation are independent, so Tv,s ~ '^Ws{12i=i^-^i '7a,,_r('))- When the 
functions jR^Ai are identical for i = 1, . . . , A;, we define "/r^a = iRAi ^^^ ^^ ' ~ X^i=i ^«i' % so 
Tv,s ~ Expg(£^' 7_r,a("))- To justify the assumption of independence, we note that coordinated 
modifications among attributes can be modeled by replacing the coordinated attributes with a single 
compound attribute (this technique requires that the attributes have identical 7_r,^(-) functions, 
which is reasonable if they change in a coordinated way) . 

Under these assumptions, let C{R) denote a partition of C{R) into subsets A such that any two 
attributes Ai,A2 G C{R) vary dependently iff they are in the same A G C{R). 

Example 6 (First alteration time) For a relation R and time s, we define Tr^s to be the 
amount of time until the next change in R, be it a tuple insertion, a tuple deletion, or an at- 
tribute modification. Also, for any S G S{R), let D{R,S,s) denote the number of tuples in S{s) 
whose deletion would force the deletion of some tuple in R{s) (and therefore D{R,R,s) = \R{s)\). 
The following proposition establishes the distribution ofTR^g- 

Proposition 4 Tr^s ~ Expg(("/j(-)), where 

CR{t) = XR(.t) + Y.D{R,S,s)fis{t) + Yl HR,AshR,A{t) 
S£S{R) AeC{R) 

and h{R, A, s) = ^^(z^omA ^A,v{s)£v ' ■ The probability of any alteration to R in the time interval 
{s,f] is 1 — e^^^^^''l'\ where 

Zr{sJ) = Kr{sJ) + J2D{R,S,s)Ms{t) + ^ M^, As)rR,^(s,/) 

seS{R) AgC{r) 
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Proof. Let T^^, T^^, and T§^ be the times until the next insertion, modification, and deletion 
in i?, respectively. From Section |^, we have that T^^ ~ Exp^(Ai?(-)). Now, for each S G S{R), 
there are D{R, S, s) tuples whose deletion would cause a deletion in R. The time until deletion of 
any such r £ S £ S{R) is distributed like Exp^(//s'(-)). The deletion processes for all these tuples 
are independent across all of S{R), so we can use (||) to conclude that 

Tg^,~ExpJ Y,DiR,S,s)fisi- 

\5g5(R) 

Prom the preceding discussion, we have 

T^,. = r?«)'^ ~ Exp, ( Y: €is)^RA-) 

\AeC{R) 

Since T^^, = minjT^g, T^,, T^,}, we therefore have, again using independence and (||), that 
Tr^s ~ Exp,(Ck(-)), where 



AeC(R) 



CR{t) = XR{t) + Y.D{R,S,s)i^s{t) + E 

S£S{R) r&R{s) 

= XR{t) + Y,D{R,S,s)fis{t) + ^M^,As)7ij,^(t). 
SeSiR) A&C{R) 



Integrating over (s, f] results in 

rf 



Zr{sJ) = J CR{t)dt 



= AR{s,f) + Y.D{R,S,s)Ms{t) + Y.hiR,A,s)jR,Ais,f). 

S(^S{R) AeC(R) 

Therefore, the probability of any alteration to R in the time interval (s, /] is 



Example g| continued (First alteration transcription policy) Suppose the user wishes to 
refresh her replica of relation R whenever the probability that it contains any inaccuracy exceeds 
some threshold it, a tactic we call the first alteration policy. Then, a refresh is required at time f 
if I- e-^«(^'-^) > vr. □ 

Given A, we describe the transition process for the value r.A over time by probabilities 

PuAs.f) = V{r.A{f)=v I r.A{s)=u] , 
for any two values u,v £ dom^ and times s < f. Under the assumption of independence, 

P^A'J) = flP^tisJ). (4) 

16 



Given any simple attribute A, we define qu,v , the relative transition rate from u to v, by 

„R,A ^ fR,A pR,A 

Given a set of attributes A with identical jR^Ai:) functions, the compound transition rate qu'v may 
be computed via 



„R,A ^ fR,A pR,A 



Let Q^'^ be the matrix of qu,'v^, where g^'tf 

Proposition 5 The matrix P^' {s,f) of elements Pu^h (s, /) is given by the matrix exponential 
formula 

P^'%, f) = eM^RAs, f) Q^'^) = E ^^•^^'' ^^" (Q^-^)"- (6) 

n=0 

Proof. Consider a continuous-time Markov chain on the same state space domvl, and with the 
same instantaneous transition probabilities Pu,v , where u,v G domA. However, in the new chain, 
the holding time in each state v is simply a homogeneous exponential random variable with arrival 

R A 

rate £v ' • We call this system the linear-time chain, to distinguish it from the original chain. Define 

^ jl 

Puv (^) to ^^ t^^ chance that the linear-time chain is in state v at time f, given that it is in state 
u at time 0. Standard results for finite-state continuous time Markov chains imply that 

P«;^(t)=exp(tQ^'^)=j:^(Q«'^)^ 

n=0 

By a transformation of the time variable, we then assert that 

P^f{sJ) = P'^^{TnA^J)), 
from which the result follows. ■ 

Example 7 (Query optimization, revisited) As the following proposition shows, our model 
can be used to estimate the histogram of a relation R at time f. A query optimizer running at time 
f could use expected histograms, calculated in this manner, instead of the old histograms Ra{s). 



Proposition 6 Assume that w{r, S), for all S G S{R), is independent of the attribute values r.A{s) 
for all A G C{R). Let tOu' (t) denote the probability that r.A{t) = u, given that r is inserted into R 
at time t. Then, for all v £ domA, 



E 



RAAf)] =PR{s,f) Yl RA,u{^)Pu,v''{s,f) 
u£domA 

+ E[A+]Y, (f^d;y{t)pn{s,f)Xn{t)P^f{tJ)dt]. (7) 



Mgdom A 
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Proof. We first compute the expected number of surviving tuples r wliose values r.A migrate to v. 
Given a value u G dom A, there are Ra,u{s) tuples at time s such that r.A{s) = u. Using the previous 
results, the expected number of these tuples surviving through time / is Ra,u{s)pr{s^ f), and the 
probability of each surviving tuple r having r.A{f) = v is Pu,v {s,f). Using the independence 
assumption and summing over all u £ dom A, one has that the expected numbers of tuples in R{s) 
that survive through / and have r.A(f) = w is 

Yl RAAs)PR{sJ)P^f{sJ)=PR{sJ) Y, RA,u{s)Puf{sJ)- 
uddoraA uddoraA 

We next consider newly inserted tuples. Recall that lOu ' {t) denotes the probability that r.A{t) = u, 
given that r is inserted into R at time t. Suppose that an insertion occurs at time t G (s,/]- The 
expected number of tuples r created at this insertion that both survive until / and have r.A(f) = v 
is 

PnitJ) j;a;«'^(t)P«^(i,/). 

u^domA 

By logic similar to Proposition |5[ one may then conclude that the expected number of newly-inserted 
tuples that survive through time / and have r.A(f) = f is 



E[^^] E ([^^y{t)pR{s,f)Xn{t)P^,;f{tJ)dt). 



The result follows by adding the last two expressions. 



Example 1^ continued We next consider whether the complexity of calculating (IT) is preferable 
to recomputing the histogram vector RA{f)- This topic is quite involved and depends heavily on the 
specific structure of the database (e.g., the availability of indices) and the specific application (e.g., 
the concentration of values in a small subset of an attribute's domain). In what follows, we lay 
out some qualitative considerations in deciding whether calculating ^ would be more efficient than 
recalculating -Ra(/) "from scratch." Experimentation with real-world application is left for further 
research. 

Generally speaking, direct computation of the histogram of an attribute A (in the absence of 
an index for A) can be done by either scanning all tuples (although sampling may also be used) or 
scanning a modification log to capture changes to the prior histogram vector i?A(s) during (s,f]. 
Therefore, the recomputation can be performed in 0(min{|i?(/)| ,T(s,/)}) time, where T{s,f) de- 
notes the total number of updates during (s,/]. Whenever \R{f)\ and T{s,f) are both large — 
i.e., the database is large and the transaction load is high — the straightforward techniques will be 
relatively unattractive. As for the estimation technique, it will probably work best when \domA\ is 
small (for example, for a binary attribute) or whenever the subset of actually utilized values in the 
domain is small. In addition, commercial databases recompute the entire histogram as a single, 
atomic task. Formula ^, on the other hand, can be performed on a subset of the attribute values. 
For example, in the case of exact matching (say, a condition of the form WHERE A = v), it is suf- 
ficient to confipute RA,vif), rather than the full RA{f) vector. Finally, it is worth noting that the 
computing the expected value of RA,v{f) I'^a ^ does not require locking R, while a full histogram 
recomputation may involve extended periods of locking. □ 
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We next consider the number of tuples in R{s) that have survived through time / without being 
modified, which we denote Y^{s, /). The expectation of this random variable is 

E[Y^{sJ)] =Pr{sJ) Y, RAAs)PvY{s,f) 

vGdova A. 

We let Y^{s,f) = Yr(s,/) — Y^{s,f) denote the number of tuples in R{s) that have survived 
through time / and were modified; it follows from the linearity of the E[-] operator that 

E[Y+{s,f)]=E[Yn{s,f)]-E[Y^{s,f)]. 

3.1.1 Complexity analysis of content-dependent updates 

In practice, as with computing a scalar exponential, only a limited number of terms will be needed 
to compute the sum (^) to machine precision. It is worth noting that efficient means of calculating 
(y) are a major topic in the field of computational probability. 

In the case of a compound attribute A = {Ai,A2, ...,Ak} with independently varying compo- 
nents, it will be computationally more efficient to first calculate the individual transition probability 
matrices P *' (s,/) via (P), and then calculate the joint probability matrix Pu^h {s,f) using (^), 
rather than first finding the joint exit rate matrix Q ' via (^) and then applying (|6|). The former 
approach would involve repeated multiplications of square matrices of size |dom Ai\, for i = 1, . . . ,k, 
resulting in a computational complexity of 0(^j^^ nj|dom^j|'^), where rii is the number of itera- 
tions needed to compute the sum (|6|) to machine precision, and the complexity of multiplying two 
nxn matrices is ©(n*^).!^ The latter would involve multiplying square matrices of size Y[i=i |dom Aj|, 
resulting in the considerably worse complexity of 0(n(]^j^^ |dom^j|)'^), where n is the number of 
iterations needed to obtain the desired precision. 

3.2 Simplified modification models 

We next introduce several possible simplifications of the general Markov chain case. To do so, we 
start by differentiating numeric domains from non-numeric domains. Certain database attributes 
A £ A, such as prices and order quantities, represent numbers, and numeric operations such as 
addition are meaningful for these attributes. For such attributes, one can easily define a distance 
function between two attribute values, as we shall see below. We call the domains dom^ of such 
attributes numeric domains, and denote the set of all attributes with numeric domains hy M C B. 
All other attributes and domains are considered non-numeric^ It is worth noting that not all 
numeric data necessarily constitute a numeric domain. Consider, for example, a customer relation 
R whose primary key is a customer number. Although the customer number consists of numeric 
symbols, it is essentially an arbitrary identification string for which arithmetic operations like 
addition and subtraction are not intrinsically meaningful for the database application. We consider 
such attributes to be non-numeric. 

3.2.1 Domain lumping 

To make our data modification model more computationally tractable, it may be appropriate, in 
many cases, to simplify the Markov chain state space for an attribute A so that it is much smaller 

^i^ = 3 for the standard method and ly = logj 7 for Strassen's and related methods. 

^Distance metrics can also be defined for complex data types such as images. We leave the handling of such cases 
to further research. 
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than dom^. Suppose, for example, that ^ is a 64-character string representing a street address. 
Restricting to 96 printable characters, A may assume on the order of 96^^ ~ lO^'^^ possible values. 
It is obviously unnecessary, inappropriate, and intractable to work with a Markov chain with such 
an astronomical number of states. 

One possible remedy for such situations is referred to as lumping in the Markov chain litera- 
ture [17|. In our terminology, suppose we can partition dom^ into a collection of sets {l^j^/gy with 



the property that |V| <C |domA| and 

Then, one can model the transitions between the "lumps" V £ V as a much smaller Markov chain 
whose set of states is V, with the transition rate from U £ V to V £ V being given by the common 
value of X^^gy Qu'v , u £ U. If we are interested only in which lump the attribute is in, rather than 
its precise value, this smaller chain will suffice. Using lumping, the complexity of the computation 
is directly dependent on the number of lumps. We now give a few simple examples: 

Example 8 (Lumping into a binary domain) Consider the street address example just dis- 
cussed. Fortunately, if an address has changed since time s, the database user is unlikely to be 
concerned with how different it is from the address at time s, but simply whether it is different. 
Thus, instead of modeling the full domain dom A, we can represent the domain via the simple bi- 
nary set {0, 1}, where indicates that the address has not changed since time s, and 1 indicates 
that it has. We assume that the exit rates 9^J,^fg\ from all other addresses v G dom^ back to the 

original value r.A{s) all have the identical value 6' . In this case, one has Pq{ = P^q = 1, and 

the behavior of the attribute is fully captured by the exit rates Iq' = Qqi and i^' = (Zi g • ^6 
will abbreviate these quantities by 9 and 9' , respectively. 

Using standard results for a two-state continuous-time Markov chain [35, Section VI. 3. 3], we 
conclude that 

<oV/) = '^'%^,. (8) 

<A->/)= e + 9' • (^) 



Example 9 (Web crawling) As an even simpler special case, consider a Web crawler (e.g., 
^dj , \l^ , ^). Such a crawler needs to visit Web pages upon change to re-process their content, 
possibly for the use of a search engine. Recalling Example ^, one may define a boolean attribute 
Modified in a relation that collects information on Web pages. Modified is set to True once the 
page has changed, and back to False once the Web crawler has visit the page. Therefore, once a 
page has been modified to True, it cannot be modified back to False before the next visit of the Web 
crawler. In the analysis of Example^ one can set 9' = 0, resulting in Pq q (s,/) = e~ ^"^'^'-'^ 
and P^f{s, /) = 1 - e-^r«,A(« J) . ' □ 

3.2.2 Random walks 

Like large non-numeric domains, many numeric domains may also be cumbersome to model directly 
via Markov chain techniques. For example, a 32-bit integer attribute can, in theory, take 2^^ « 
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4 X 10^ distinct values, and it would be virtually impossible to directly form, much less exponentiate, 
a full transition rate matrix for a Markov chain of this size. 

Fortunately, it is likely that such attributes will have "structured" value transition patterns that 
can be modeled, or at least closely approximated, in a tractable way. As an example, we consider 
here a random walk model for numeric attributes. 

In this case, we still suppose that the attribute A is modified only at transition event times that 
are distributed as described above. Letting tj denote the time of transition event i, with to = s, we 
suppose that at transition event i, the value of attribute A is modified according to 

r.A{ti) = r.A{ti^i) + AAi, 

where AAi is a random variable. We suppose that the random variables {A^j} are IID, that is, 
they are independent and share a common distribution with mean 6 and variance a^. Defining 

AAis,f)= ^AA, 

we obtain that {AA{s,f),f > s} is a nonhomogeneous compound Poisson process, and r.A{f) = 
r.A{s) + AA{s, /). From standard results for compound Poisson processes, we then obtain for each 
tuple r e R{s) that E[r.A{f)] = r.A{s) +TrXs, f)6. 

It should be stressed that such a model must ultimately be only an approximation, since a 
random walk model of this kind would, strictly speaking, require an infinite number of possible 
states, while douiA is necessarily finite for any real database. However, we still expect it to be 
accurate and useful in many situations, such as when r.A{s) and E[r.y4(/)] are both far from largest 
and smallest possible values in douiA. 

3.2.3 Content-independent overwrites 

Consider the simple case in which Pu^h (s, /) is independent of u once a transition event has 
occurred. Let A C C{R) be a set of attributes A with identical 7i?,A functions, and let r/j^^(s, t) = 
r^^yi(s,t) for any A £ A. We define a probability distribution ujr^_a over dom^, and assume that 
at each transition event, a new value for A is selected at random from this distribution, without 
regard to the prior value of r.A. It is thus possible that a transition event will leave r.A unchanged, 
since the value selected may be the same one already stored in r. For any tuple r G R{s)riR{f) and 
u G dom^, we thus compute the probability Pu,u (s,/) that the value of r.A remains unchanged 
at u at time / to be 

P^A', f) = P{r^'-^ >f-s}+ P{r^'^ <f-s} LOnA^) 
= e-^"'^r«-^(^'^) (1 - corAu)) + corA^) 

it ti / W, W( 

from ?i to V in [s, /) to be 



For u,v £ dom^ such that u ^ v, we also compute the probability Pu,v is,f) that r.A changes 



PuA', f) = P{C'-^ <f-s} u^RAiv) 

Content-independent overwrites are a special case of the Markov chain model discussed above. 
To apply the general model formulae when content-independent updates are present, each i^ ' is 
multiplied by 1 — ujR^Aiv) and Pu,h = iOR^A{v)/{l — w_r,a(^)) for all u,v £ domvl, u y^ v. 
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3.3 Summary 

In this section we have introduced a general Markov-chain model for data modification, and dis- 
cussed three simplified models that allows tractable computation. Using these models, one can 
compute, in probabilistic terms, value histograms at time /, given a known initial set of value 
histograms at time s < f. Such a model could be useful in query optimization, whenever the 
continual gathering of statistics becomes impossible due to either heavy system loads or structural 
constraints {e.g., federations of databases with autonomous DBMSs). 

Generally speaking, computing the transition matrix for an attribute A involves repeated 
multiplications of square matrices of size |domj4|, resulting in a computational complexity of 
0(n|dom ^1*^), where n is the number of iterations needed to compute the sum (g) to machine pre- 



cision. While n is usually small, |domyl| may be very large, as demonstrated in Section 3.2.1 and 



Section 3.2.2| . Methods such as domain lumping would require ©(nX*^) time, where X <C |domj4|.[| 



As for random walks and independent updates, both methods no longer require repeated matrix 
multiplications, but rather the computation of rij^^(s, /). The complexity of calculating Tji^Ais, f) 
is similar to that for A/j(s, /) in Section |2.1.1| . 

4 Insertion model verification 

It is well-known that Poisson processes model a world where data updates are independent from 
one another. While in databases with widely distributed access, e.g., incoming e-mails, postings to 
newsgroups, or posting of orders from independent customers, such an independence assumption 
seems plausible, we still need to validate the model against real data. In this section we shall present 
some initial experiments as a "proof of concept." These experiments deal only with the insertion 
component of the model. Further experiments, including modification and deletion operations, will 
be reported in future work. 

Our data set is taken from postings to the DBWORLD electronic bulletin board. The data 
were collected over more than seven months and consists of about 750 insertions, from November 
9^^, 2000 through May 14**^, 2001. Figure |2| illustrates a data set with 580 insertions during the 
interval [2000/11/9:00:00:00, 2001/3/31:00:00:00). We used the Figure|data as a training set, i.e., 
it serves as our basis for parameter estimation. Later, in order to test the model, we applied these 
parameters to a separate testing set covering the period [2001/3/31:00:00:00, 2001/5/15:00:00:00). 
In the experiments described below, we tried fitting the training data with two insertion-only 



models, namely a homogeneous Poisson process and an RPC Poisson process (see Section 2.1). 
For each of these two models, we have applied two variations, either as a compound or as a non- 
compound model. In the experiments described below, we have used the Kolmogorov-Smirnov 
goodness of fit test (see for example [p!6| . Section 7.7]). For completeness, we first overview the 
principles of this statistical test. 

The Kolmogorov-Smirnov test evaluates the likelihood of a null hypothesis that a given sample 
may have been drawn from some hypothesized distribution. If the null hypothesis is true, and 
sample set has indeed been drawn from the hypothesized distribution, then the empirical cumulative 
distribution of the sample should be close to its theoretical counterpart. If the sample cumulative 
distribution is too far from the hypothesized distribution at any point, that suggests that the sample 
comes from a different distribution. Formally, suppose that the theoretical distribution is F(x), 
and we have n sample values xi,...,Xn in nondecreasing order. We define an empirical cumulative 



'^Here, n may also be affected by the change of domain. 
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Figure 2: Training data set. 



distribution Fn{x) via 



Fn{x) 



0, if x < xi 

^, iixk<x<Xk+i 

J- . iJ. tiy ^ tijf^t 



and then compute D^ = sup^^j^ „{|F„(xa;) — F{xk)\}- For large n, given a significance level a, 
the test measures Dn against X{a)/^/n, where X{a) is a factor depending on the significance level 
a at which we reject the null hypothesis. For example, X(0.05) = 1.36 and X(O.l) = 1.22. The 
value of a is the probability of a "false negative," that is, the chance that the null hypothesis might 
be rejected when it is actually true. Larger values of a make the test harder to pass. 

4.1 Fitting the homogeneous Poisson process 

Based on the training set, we computed the parameter for a homogeneous Poisson process by 
averaging the 580 interarrival times, an unbiased estimator of the Poisson process parameter. The 
average interarrival time was computed to be 5:15:19, and thus A = 4.57 per day. Figure y(a) 
provides a pictorial comparison of the cumulative distribution functions of the interarrival times 
with their theoretical counterpart. We applied the Kolmogorov-Smirnov test to the distribution of 
interarrival times, comparing it with an exponential distribution with a parameter of A = 4.57. The 
outcome of the test is D^ = 0.106, which means we can reject the null hypothesis at any reasonable 
level of confidence a > 0.005 (for a = 0.005, the rejection threshold is 0.0718 for n = 580). In all 
likelihood, then, the data are not derived from a homogeneous Poisson process. 

Next, we have applied a compound homogeneous Poisson model. Our rationale in this case is 
that DBWORLD is a moderated list, and the moderators sometimes work on postings in batches. 
These batches are sometimes posted to the group in tightly-spaced clusters. For all practical 
purposes, we treat each such cluster as a single batch insertion event. To construct the model, 
any two insertions occurring within less than one minute from one another were considered to 
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Figure 3: A comparison of a theoretical and empirical distribution functions for the homogeneous 
Poisson process model (a) and the compound homogeneous Poisson process model (b). 





Workdays 


Saturday 


Sunday 


[0:00,3:00) 


2.40 


1.50 


1.15 


[3:00,6:00) 


5.96 


[6:00,9:00) 


6.04 


[9:00,18:00) 


7.50 


[18:00,21:00) 


3.03 


[21:00,24:00) 


2.41 



Table 2: Average A levels for the recurrent piecewise-constant Poisson model. 

be a single event occurring at the insertion time of the first arrival. For example, on November 
14, 2000, we had three arrivals, one at 13:43:19, and two more at 13:43:23. All three arrivals are 
considered to occur at the same insertion arrival event, with an insertion time of 13:43:19. Using 
the compound variation, the data set now has 557 insertion events. The revised average interarrival 
time is now 5:28:20, and thus A = 4.39 per day. Figure y(b) provides a pictorial comparison of 
the cumulative distribution functions of the interarrival times, assuming a compound model, with 
their theoretical counterpart. We have applied the Kolmogorov-Smirnov test to the distribution 
of interarrival times, comparing it with an exponential distribution with a parameter of A = 4.39. 
The outcome was somewhat better than before. D^ = 0.094, which means we can still reject the 
null hypothesis at any level of confidence a > 0.005 (for a = 0.005, the rejection threshold is 0.0733 
for n = 557). Although the compound variant of the model fits the data better, it is still not 
statistically plausible. 

4.2 Fitting the RPC Poisson process 

Next, we tried fitting the data to an RPC model. Examining the data, we chose a cycle of one 
week. Within each week, we used the same pattern for each weekday, with one interval for work 
hours (9:00-18:00), plus five additional three-hour intervals for "off hours". We treated Saturday 
and Sunday each as one long interval. Table |2| shows the arrival rate parameters for each segment of 
the RPC Poisson model, calculated in much the same manner as the for the homogeneous Poisson 
model. 
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Figure 4: A comparison of a theoretical and empirical distribution functions of U for the RFC 
Foisson model (a) and the compound RFC Foisson model (b). 



The specific methodology for structuring the RFC Foisson model is beyond the scope of this 
paper and can range from ad hoc "look and feel" crafting (as practiced here) to more established 
formal processes for statistically segmenting, filtering, and aggregating intervals p^ , 33|. It is worth 
noting, however, that from experimenting with different methods, we have found that the model 
is not sensitive to slight changes in the interval definitions. Also, the model we selected has only 
8 segments, and thus only 8 parameters, so there is little danger of "overfitting" the training data 
set, which has over 500 observations. 

Next, we attempted to statistically validate the RFC model. To this end, we use the following 
lemma: 



Lemma 6 Given a nonhomogeneous Poisson process with arrival intensity X{t), the random vari- 

Is 



able Us = J^ 



X{t)dt is of the distribution Exp(l). 



Proof. Let fs{t) = A(s, s + t), which is a monotonically nondecreasing function. From Lemma ||, 
F{L]i^s < t} =1 — e-^^^*' for all t > 0. We have Ug = fs{Lji^s)- By applying the monotonic function fg 
to both sides of the inequality Lr^s < t, one has that P{fs{Lji^s) < fs{t)} = F{L/j^s < t} = 1 — e-^'^*^ 
for all t > 0. Substituting in the definitions of Us and u = fs{t), one then obtains P{Us < n} = 
1 — e~" for all n > 0, and therefore Us ~ Exp(l). ■ 

Thus, given an instantaneous arrival rate A(t), and a sequence of observed arrival events {tn}^=o; 
we compute the set of values Un = J^" A(t) dt, n = 1, . . . ,N, and perform a Kolmogorov-Smirnov 
test of them versus the unit exponential distribution. 

Figure |^(a) provides a comparison of the theoretical and empirical cumulative distribution of 
the random variable U. We applied the Kolmogorov-Smirnov test to U, comparing it with an 
exponential distribution with A = 1, based on Lemma ^ The outcome of the test is Dn = 0.080, 
which is better than either homogeneous model, but is still rejected at any reasonable level of 
significance (recall that for a = 0.005, the rejection threshold is again 0.0718 for n = 580). 

Finally, we evaluated a compound version of the RFC model, combining successive postings 
separated by less than one minute. We kept the same segmentation as in Table §, but recalculated 
the arrival intensities in each segment, as shown in Table 13. 

Next we recalculated the sample of the random variable U for the compound RFC Foisson 
model, and applied the Kolmogorov-Smirnov test. In this case, we have D^ = 0.050, which cannot 
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Table 3: Average A levels for the compound RPC Poisson model. 



Model 


Dn 


Rejection level 


Homogeneous 


0.106 


< 0.005 


Homogeneous+compound 


0.094 


< 0.005 


RPC 


0.080 


< 0.005 


RPC+compound 


0.050 


> 0.100 



Table 4: Goodness of fit of the four models. 

be rejected at any reasonable confidence level through a = 0.10 (for a = 0.10, the rejection threshold 
is 0.0517 for n = 557). Figure ^(b) shows the theoretical and empirical distributions of U in this 
case. 

As a final confirmation of the applicability of the compound RPC Poisson model, we attempted 
to validate the assumption that the number of postings in successive insertion events are inde- 
pendent and identically distributed (IID). In the sample, 536 insertion events were of size 1, 19 
were of size 2, and 2 were of size 3. Thus, we approximate the random variable A^ as having a 
536/557 w .962 probability of being 1, a 19/557 « .034 probability of being 2, and a 2/557 « .004 
probability of being 3. Validating that the observed insertion batch sizes A^^ appear to be in- 
dependently drawn from this distribution is somewhat delicate, since they nearly always take the 
value 1. To compensate, we performed our test on the runs in the sample, that is, the number of 
consecutive insertion events of size 1 between insertions of size 2 or 3. Our sample contains 21 runs, 
ranging from to 112. If the insertion batch sizes { A^ ^ are independent with the distribution A^, 
then the length of a run should be a geometric random variable with parameter 536/557 ~ .962. 
We tested this hypothesis via a Kolmogorov-Smirnov test, as shown in Figure |5|. The D„ statistic 
is 0.207, which is within the a = 0.1 acceptance level for a sample of size n = 21 (although the 
divergence of the theoretical and empirical curves in Figure g is more visually pronounced than in 
the prior figures, it should be remembered that the sample is far smaller). Thus, the assumption 
that the insertion batch sizes {A^^} are IID is plausible. 

Table Q compares the goodness-of-fit of the four models to the test data. For each of the models, 
we have specified the KS test result {Dn) and the level at which one can reject the null hypothesis. 
The higher the level of confidence is, the better the fit is. The RPC compound Poisson model 
models best the data set, accepting the null hypothesis at any level up to 0.1 (which practically 
means that the model can fit to the data well). The main conclusion from these experiments is 
that the simple model of homogeneous Poisson process is limited to the modeling of a restricted 
class of applications (one of which was suggested in [0]). Therefore, there is a need for a more 
elaborate model, as suggested in this paper, to capture a broader range of update behaviors. A 
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Figure 5: Empirical and theoretical distributions for number of single arrivals between multiple 
arrivals, compound RCP Poisson model. 



nonhomogeneous model consisting of just 8 segments per week, as we have constructed, seems to 
model the arrivals significantly better than the homogeneous approach. 



5 Content evolution cost model 

We now develop a cost model suitable for transcription-scheduling applications such as those de- 
scribed in Example ^. The question is how often to generate a remote replica of a relation R. We 
have suggested one such policy in Example ^. In this section, we shall introduce two more policies 
and show an empirical comparison based on the data introduced in Section |^. 

A transcription policy aims to minimize the combined cost of transcription cost and obsolescence 



cost |11|. The former includes the cost of connecting to a network and the cost of transcribing the 
data, and may depend on the time at which the transcription is performed {e.g., as a function 
of network congestion), and the length of connection needed to perform the transcription. The 
obsolescence cost captures the cost of using obsolescent data, and is basically a function of the 
amount of time that has passed since the last transcription. 

In what follows, let the set {bi,ei}'^i represents an infinite sequence of connectivity periods 
between a client and a server. During session i, the client data is synchronized with the state of the 
server at time hi, the information becoming available at the client at time ej. At the next session, 
beginning at time ftj+i, the client is updated with all the information arriving at the server during 
the interval {hi, 6j+i], which becomes usable at time ej+i, and so forth. We define 5o = eo = 0, and 
require that < 6i < ei < 62 < 62 < . . . . 

Let Cij_u(si /) denote the cost of performing a transcription of R starting at time /, given that 
the last update was started at time s. Let CR^ois,f), to be described in more detail later, denote 
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the obsolescence cost through time / attributable to tuples inserted into R at the server during the 
time interval (s, /]. Then the total cost CR{t) through time t is 

CfiW = Y.(aCR,u{bi-i,bi) + {1 - a)CR,oibi-ubi)^ + {I - a)CR,o{bi*(t),t), (10) 

i:bi<t 

where i*{t) = maxji I 6j < t} and a serves as the ratio of importance a user puts on the tran- 
scription cost versus the obsolescence cost. Traditionally, a = 0, and therefore Cij(t) is minimized 
for CR^o{bi-i,bi) = 0, V6j < t, allowing the use of current data only. In this section we shall look 
into another, more realistic approach, where data currency is sacrificed (up to a level defined by 
the user through a) for the sake of reducing the transcription cost. Ideally, one would want to 
choose the sequence {bi,ei}^i of connectivity periods, subject to any constraints on their dura- 
tions Cj — bi, to minimize Cii{t) over some time horizon t. One may also consider the asymptotic 
problem of minimizing the average cost over time, lirui^oo CR{t)/t. We note that the presence of 
a is not strictly required, as its effects could be subsumed into the definitions of the C/j^u and Cr^o 
functions, especially if both are expressed in natural monetary units. However, we retain a in order 
to demonstrate some of the parametric properties of our model. 

In general, modeling transcription and obsolescence costs may be difficult and application- 
dependent. They may be difficult to quantify and difficult to convert to a common set of units, such 
as dollars or seconds. Some subjective estimation may be needed, especially for the obsolescence 
costs. However, we maintain that, rather than avoiding the subject altogether, it is best to try 
construct these cost models and then use them, perhaps parametrically, to evaluate transcription 
policies. Any transcription policy implicitly makes some trade-off between consuming network 
resources and incurring obsolescence, so it is best to try quantify the trade-off and see if a better 
policy exists. In particular, one should try to avoid policies that are clearly dominated, meaning that 
there is another policy with the same or lower transcription cost, and strictly lower obsolescence, 
or vice versa. Below, for purposes of illustration, we will give one simple, plausible way in which 
the cost functions may be constructed; alternatives are left to future research. 

5.1 Transcription costing example 

In determining the transcription cost, one may use existing research into costs of distributed query 
execution strategies. Typically, {e.g., pl| ]) the transcription time can be computed as some function 
of the CPU and I/O time for writing the new tuples onto the client and the cost of transmitting 
the tuples over a network. There is also some fixed setup time to establish the connection, which 
can be substantial. For purposes of example, suppose that 

Cr,u(s, /) = c + /? • {Xr{s, f) + Y+{s, f) + \R{s)\ - YRis, /)) 
= c + /? • {Xr{s, f) + \R{s)\ - Y-{s, /)) 

Here, c > denotes the fixed setup cost, /? > 0, XR{s,f) denotes the number of tuples inserted 
during the interval (s, /] that survive through time /, Yj^{s, /) is the number of tuples that survive 
but are modified, by time /, and |-R(s)| — Yr{s, f) is the number of deleted tuples. For the latter, 
it may suffice to transmit only the primary key of each deleted tuple, incurring a unit cost of less 
than p. For sake of simplicity, however, we use the same cost factor /3 for deletion, insertion, and 
modification. We note that, under this assumption, 

Y, CR,^{bi-i,bi)=n{t)c + /3\R{s)\+pYl {XR{b^^i,h) -Y^{bi^i,bi)) , 

i:bi<t i:bi<t 

28 



where n{t) is the number of transcriptions in the interval [0, t]. For the special case that there are 
no deletions or modifications, (3 |^(s)| + P {Xr{s, f) — Y£{s, /)) = (3B{s, f) and 

Y, CR,^ibi.i,bi) = n{T)c + PB{0,bi,^T))- 

i:bi<t 

For large t, one would expect the f3B{0, bi*u\) term to be roughly comparable across most reasonable 
polices, whereas the n{t)c term may vary widely for any value of t. It is worth noting that c and (3 
could be generalized to vary with time or other factors. For example, due to network congestion, 
certain times of day may have higher unit transcription costs than others. Also, transcribing via 
airline-seat telephone costs substantially more than connecting via a cellular phone. For simplicity, 
we have refrained from discussing such variations in the transcription cost. 

5.2 Obsolescence costing example 

We next turn our attention to the obsolescence cost, which is clearly a function of the update time 
of tuples and the time they were transcribed to the client. Intuitively, the shorter the time between 
the update of a tuple and its transcription to the client, the better off the client would be. As a basis 
for the obsolescence cost, we suggest a criterion that takes into account user preferences, as well 
as the content evolution parameters. For any relation R, times s < f, and tuple r £ i?(s) U R{f), 
let 6(r) and d{r) denote the time r was inserted into and deleted from R, respectively. We let 
ir{s,f) be some function denoting the contribution of tuple r to the obsolescence cost over {s,f]; 
we will give some more specific example forms of this function later. We then make the following 
definition: 

Definition 3 The total obsolescence cost of a relation R over the time interval {s, f] (annotated 
Cr,o(.s, f)) is defined to be Cr^o{s, f) = Y.reR{s)uR{f) '^ri^, f) □ 

Our principal concern is with the expected obsolescence cost, that is, the expected value of 
Cr,o{sJ), 



E[C,j,o(s,/)]=E 
To compute 'E[Cr^o{s-, f)]-, we note that 



r&Ris)UR(f) 



E[CR,,{s,f)]=E 



Yl ^r{sj) 
reR(s)nRif) 



+ E 



reRis)\R{f) 



+ E 



Yl ^r{sj) 
T(iR{f)\R{s) 



The three terms in the last expression represent potentially modified tuples, deleted tuples, and 
inserted tuples, respectively. We denote these three terms by t^{s,f), i^{s,f), and i^(s,/), re- 
spectively, whence 

HCrAs, /)] = i'^is, f) + 1%, f) + 4(5, /). 



29 



5.3 Obsolescence for insertions 

We will now consider a specific metric for computing the obsolescence stemming from insertions in 
(s, /], as follows: 

J(, f^-S 9\s,f,b{r)) s<b{r)<f<d{r) 

''As,J)-<^ Q otherwise, ^ ^ 

where g^{s,f,t) is some application-dependent function representing the level of importance a 
user assigns, over the interval (s,/], to a tuple arriving at a time t. For example, in an e-mail 
transcription application, a user may attach greater importance to messages arriving during official 
work hours, and a lesser measure of importance to non-work hours (since no one expects her to be 
available at those times). Thus, one might define 

I. . [^ ( \^ 1, ( \ / «i' if r is during work hours , . 

g {s, f, t) = j^ a{r) dr, where a{r) = | ^^^ .^ ^ .^ ^^^^^ ^^^^^^ (12) 

and ai > a2- For ai = 02 = 1, g {s,f,t) takes a form resembling the age of a local element in [Q]. 
More complex forms of g^{s,f,t) are certainly possible. In this simple case, we refer to 01/02 as 
the preference ratio. 

Using the properties of nonhomogeneous Poisson processes, we calculate 

tkis,f)=E ^. .(.,/) 

_reRif)\R{s) 

= E[Xr{s, /)] • E[/(s, /, 6(r)) I s < b{r) <f< d{r)] 
= An{s,f)E[A+] [^ J^g\sJ,t')dt' 

= E[A+]y" \ii{t')g\sj,t')dt'. 

Example 10 (Transcription policies using tlie expected obsolescence cost) Consider the 
insertion- only data set of Section 0. Figure |^ compares two transcription policies for the week 
[2001/4/2:0:00,2001/4/8:0:00). The transcription policy in Figure Wa) (referred to below as the 
uniform synchronization point — USP — policy) was suggested in^^. According to this policy, 
the intervals (s,/] are always of the same size. The decision regarding the interval size f — s 
may be either arbitrary (e.g., once a day) or may depend on X, the Poisson model parameter (in 
which case a homogeneous Poisson process is implicitly assumed). The policy may be expressed as 
f = s + M/X for some multiplier M > 0. According to this policy with M = 1 (as suggested in 
^), and X = 4.57 per day as computed from the training data. Therefore, one would refresh the 
database every 5:15:19. Figure^(a) shows the transcription times resulting from the USP policy. 

Consider now another transcription policy, dubbed the threshold policy. With this policy, given 
that the last connection started at time s, we transcribe at time f if the expected obsolescence cost 
from insertions (l}i{s,f)) exceeds 11, where H is a threshold that measures the user's tolerance to 
obsolescent data. In comparing the two policies, one can compute 11, given M , as follows. Consider 
the homogeneous case where E[AJ^] = 1 and A/j(t) = Xr for all t. Assume further that ai = 02 = 1 
for all t. In this case, 

this, f) = I Xr-U- t)dt = XrI if- t)dt = ^Xr ■ if - sf 
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Figure 6: Transcription times for the USP and RFC /threshold pohcies. 
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Figure 7: Threshold pohcy, homogeneous vs. RPC. 



Setting f = s + M/\r and H = tj^{s, f), one has that 



n = {l/2)Xn ■ if - sf = (1/2)Ar • (M/XRf = M^/2Xr. 



Figure ^(b) shows the transcription times using the RPC arrival model (see Section 4jJ_) and the 
threshold policy with U = 0.109 (obtained by setting M = 1 and A = 4.57 per day, and letting 
n = M'^/2X). It is worth noting that transcriptions are more frequent when the X intensity is 
higher and less frequent whenever the arrival rate is expected to be more sluggish. 

We have performed experiments comparing the performance of the threshold policy for the homo- 
geneous Poisson model (equivalent to the USP policy) and the RPC Poisson model. Figure [^ shows 
representative results, with costs computed over the testing set. Figure^(a) displays the obsolescence 
cost and the number of transcriptions for various M values, with a preference ratio 02/01 = 4. For 
all M values, there is no dominant model. For example, for M = 1, the RPC model has a slightly 
higher obsolescence cost (43.02 versus 42-35, a 1.6% increase) and a significantly lower number of 
transcriptions (137 versus 204, 0, 32.8% decrease). 

Figure ^(b) provides a comparison of combined normalized obsolescence and transcription costs 
for both insertion models and a S {0.6,0.7,0.8} (still assuming a 4-1 preference ratio). Solid lines 
represent results related with the homogeneous Poisson model, while dotted lines represent results 
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Figure 8: Transcription schedule based on the first alteration policy. 



related with the RFC Poisson model. Generally speaking, the RFC model performs better for small 
M values (M < 7), while the homogeneous model performs better for the largest M values (M > S). 

D 



Example 11 (Comparison of USP, threshold and FA policies) Once again with the data 
from Section 0, we consider one more transcription policy, the first alteration (FA) policy derived 
from the analysis of Example^. Since there are no deletions or modifications, Zji{s,f) simplifies 
to Aji{s,f). We choose vr in the FA policy to be a function of M such that the transcription in- 
tervals agree with the USF policy in the case of the homogeneous model. Figure |^ compares the 
performance of all three transcription policies: USF, threshold, and FA, for a 4:1 preference ratio 
and a = 0.8, using the testing data set to compute the costs. For M = 1, the threshold policy and 
the FA policy perform similarly, where the FA policy performs slightly better than the Threshold 
policy. Both policies outperform the USP policy. The threshold policy is best for M £ {2 . . .8}. For 
all M > 8, the USF policy is best. The best policy for this choice of ai/a2 and a is threshold with 
M = 6, followed closely by FA with M E {5,6}. We have conducted our experiments with various 
a values and our conclusion is that the Threshold model is preferred over the USF model for larger 
a, that is, the more the user is willing to sacrifice currency for the sake of reducing transcription 
cost. □ 



5.4 Obsolescence for deletions 



In a similar manner to Section |5.3| , we will consider the following metric for computing the obso- 
lescence stemming from deletions in {s, f]. We compute tj?(s,/) via 

,D. .^ _ / 5°(s, /, dir)) 6(r) < s < d{r) <f 

''^^'''^^"10 otherwise ^^'^^ 

where g^{s,f,t) is some application-dependent function, possibly similar to g^{s,f,t) above. 
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Figure 9: Comparison of three policies, for a 4:1 preference ratio and a = 0.5. 



Using the properties of nonhomogeneous Poisson processes, we calculate 



-D 



rei?(s)\i?(/) 

= (|i?(s)| -E[yK(5,/)])E[5°(5,/,d(r)) I 6(r) < s < d{r) < f] 
= {\R{s)\ -Pr{sJ) \R{s)\)E[g''{sJ,d{r)) \ b{r) < s < d{r) < f] 
= \R{s)\ {l-pR{s,f))E[g''{sJ,d{r)) \ b{r) < s < d{r) < f] 

In the case {R,S) has fixed multiplicity for all 5 G S{R), PR{s,f) = exp(— M/j(s, /)), where 
Mr{sJ) = jf ilR{t)dt and /i/j(i) =Y^seS{R)'^iR^^)f^sit)- Therefore, 



rD 



f^AsJ) = \R{s)\(l-exp{-MR{s,f)) 



f 



g''{sj,t')dt' 



\R{s) 



l-exp(-MH(g,/)) 
Mr{sJ) 



MR{s,f) 

m{t')-g^{sj,t')dt' 



5.5 Obsolescence for modification 

We now consider obsolescence costs relating to modifications. While, in some applications, a user 
may be primarily concerned with how many tuples were modified during [s,/), we believe that a 
more general, attribute-based framework is warranted here, taking into account exactly how each 
tuple was changed. Therefore, we define Lr^A{s,f) to be some function denoting the contribution 
of attribute A £ A{R) in tuple r to the obsolescence cost over (s, /] and assume that 

A£A(R) 
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Therefore, 



^in^,/)=E 



E 



reR{s)nR(f) 



AeA(R) reR{s)nR{f) 



X] ^rM^J) 



AeA{K) 

where t^y\{s, f) is the expected obsolescence cost due to modifications to A during (s, /]. Assuming 
that attributes not in C{R) incur zero modification cost, the last sum may be taken over C{R) instead 
of A{R). 

We start the section by introducing the notion of distance metric and provide two models of 
h,A{s,f), for numeric and non-numeric domains. We then provide an explicit description of t^y^^, 
based on distance metrics. 



5.5.1 General distance metrics 

Let Cu,'v , where u,v £ dom A denote the elements of a matrix of costs for an attribute A. We 
declare that if r.A{s) = u and r.A{f) = v, then ir,Ais, f) = Cu,v , or equivalently, 

( f\ _ RA 

ir.yHS, JJ — C^.A(s),r.A(/)- 

Consequently, we require that Cu',u = for all u G dom A, so that an unchanged attribute field 
yields a cost of zero. 

A squared-error metric for numeric domains: For numeric domains, that is, A G N , we 
propose a squared-error metric, as is standard in statistical regression models. In this case, we let 

ir,A(s,/) = C^A{5),r.A{/) = ^R,A(s)(r.A(/) - r.A{s)f , 

where /c_r,a(s) is a user-specified scaling factor. A typical choice for the scaling factor would be the 
reciprocal 1/ (Varj.g^(-s)[r.^(s)]) of the variance of attribute A in i? at time s. 



Var^eR(s)[r.A(s)] = 'EreR(s) {r.A{s) - E^gR(s)[r.yl(s)]) 



E^cR^„^ [^-^(5)^] - ^r&R(s)[r-A{s)f 



'-^\(l:!'''''')-y^\.Lf'' 



Other choices for the scaling factor kR^A{s) are also possible. In any case, we may calculate the 
expected alteration cost for attribute A in tuple r via 

E[.,,a(s, /)] = E \kR,A{s){r.A{f) - r.A{s)y 



^R,A 



's)E 



r.A{fy - 2r.A{f)r.A{s) +r.A{s) 



kRXs)(E r.Aiff -2r.Ais)E[r.Aif)]+r.A{s) 



(14) 



34 



A general metric for non-numeric domains: For non-numeric domains, it may not be possible 
or meaningful to compute the difference of r.A{s) and r.A{f). In such cases, we shall use a general 
cost matrix [cu^v ]uvGdomA ^^^ compute 



>«.^ f. f\\ Ic^^A 



v&AoraA 

^ Z^ yr.A{s),v^^^f!) \^r^A{s),v 
i)Gdom A 

v^r.A(s) 

For domains that have no particular structure, a typical choice might be Cu^v = 1 whenever 
u ^ V. In this case, the expected cost calculation simplifies to 

E[LrA^,f)]=F{r.Aif)^r.Ais)} 

^ ■*■ ~ K.A{s),r.A{s)^^'J')- 

We are now ready to consider the calculation of t^j\{s, /). 

5.5.2 The expected modification cost 

We next consider computing the expected modification cost t^^(s, /). To do so, we partition the 
tuples r in R{s)nR{f) according to their initial value r.A{s) of the attribute A. Consider the subset 
RA,uis)r\R{f) of all r G R{s)r\R{f) that have r.A{s) = u. Since all such tuples are indistinguishable 
from the point of view of the modification process for (i?. A), their ir^s, f) random variables will 
be identically distributed. The number of tuples r G R{s) with r.A{s) = u is, by definition, Ra.u{s)- 
The number \Ra,u{s) H R{f)\ that are also in r.A{f) is a random variable whose expectation, by the 
independence of the deletion and modification processes, must be pr{s, f)RA,u{s)- Using standard 
results for sums of random numbers of IID random variables, we conclude that 



^h;AisJ) 
r€Ris)nR{f) 



= Y. {pR{s,f)RAAs))^W,Ais,f)\r.A{s) 
uSdom A 



iigdom A 
Ra,u{s)>0 



^ PRis, f) Y^ RA,uis)iR,A,ui^^ /)' 



uGdoin A 
Ra,u{-'')>0 



where we define i^Aui^-i f) ~ E[ir,A(s,/) | r.A{s) = u\. We now address the calculation of the 



'""iJ^J)- 



''R,A'. 



For a non-numeric domain, we have from Section 5.5.11 that 



Dgdom A 
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and in the simple case of Cu,'v = 1 whenever u ^ v, 

In any case, Pu,v (s, /) and Pu^u (s, /) may be computed using the results of Section ^. 
For a numeric domain, we have from (U) that 

^^,aA'J) = kRAis) (E[(r.A(/))2 | r.A{s) = u] -2u^[r.A{f) \ r.A{s) = u\ + u^ 
= kRAis) ( ( E (^' - 2^^X^(^' /) ) + «' ) • 

\ \Dedomyl / / 

In cases where a random walk approximation applies, however, the situation simplifies consid- 
erably, as demonstrated in the following proposition. 

Proposition 7 When a random walk model with mean 5 and variance a"^ accurately describes 
modifications to a numeric attribute A, ^^Aui^if) ~ fei?,A(s) r_R,yl(s, /) {cr'^ + 2r/j^^(s, /)5^) . 

Proof. In this case, we note that the random variable r.A{f) — r.A(s) is identical to AA{s, f) 



(using the notation of section 3.2.2), and is independent of r.A{s). The number A'^ of modification 
events in (s, f] has a Poisson distribution with mean r^^^(s,/), and hence variance rR^^(s,/)^. 
Therefore we have, for any u G dom^, 



L 



R,AA^J)^knAs)^ (A^(s,/)) 



\2 



kR,A{s) [YaT[AA{s, /)] + E[AA{s, /)] 
kR,Ai.^) (e[^] (^^ + S^ Var[iV] + E[Nf6^ 
kRA^) ri?„A(s, /) {a^ + 2 TrAs, f)S^) . 



5.6 Example: the use of the cost model in Web crawling 

The following example concludes the introduction of the cost function. We show how, by using the 
cost model, one can generate an optimal transcription policy for Web crawling. 



Example 12 (Web Monitoring) WebSQL \23^] is a Web monitoring tool which uses a virtual 
database schema to query the structural properties of Web documents. The database schema consists 
of two relations, Document with six attributes, namely url, title, text, type, length, and 
modif, and Anchor with four attributes, namely base, label, href, and context. Each tuple 
in Anchor indicates that document base contains a link to document href. Consider the follow- 
ing query (taken from http://www. cs. toronto. edu/~websql/), which identifies locally reachable 
documents that contain some hyperlink to a compressed Postscript File: 

SELECT d.url, d. modif 



FROM Document d SUCH THAT ' Jittp: //www. OtherDoc. html] " ->->* d. 

Anchor a SUCH THAT base = d 

WHERE fi I ename (a . href) CONTAINS " .ps.Z" ; 
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(We refrain from dwelling on the language specification;he interested reader is referred to the 
cited Web site.) Assume that the cost of performing the query at time t is YldeD(t) V'd; where 
D{t) represents the set of scanned documents and 4>d is a random variable representing the size of 
document d in bytes. Assuming the {ipd} are IID, the expected cost of performing the query at time 
t is thus 



E 



deD{t) 



i'd 



E[|Z)(t)|]E[V'], 



where ip is a generic random variable distributed like the {ipd}- 

A modification to a document is identified using changes to the modif attribute of the Document 
relation. For brevity in what follows, we let R = Document and A = modif. We assign the following 
costs to changes in A: 

• 5 (s,/, i) = for all s < t < f, that is, the user has no interest in being notified of deleted 
documents. 



For all s < t < f and u,v £ domA, u ^ v, c 



R,A 



9^{s,f,t) = E[^], where 



M 



is the cost 



R,A' 



for a modified document. For all other attribute A' ^ A, Cu,'v = for all u,v £ dom^'. 

Suppose that a query was performed at time s, scanning the set of documents D{s), and returning 
the set of documents B{s), where \B{.s)\ < \D{.s)\. A user is interested in refreshing the query result 
without overloading system resources, thus balancing the cost of refreshing the query results against 
the cost of using partial or obsolescent data. This trade-off can be captured by the following policy: 
refresh the query at time f, after performing it at time s iff 



E 



< E[Cfi,o(s,/)] 



[dGD{/) 

Thus, an equivalent conditions is 

E[|Z)(/)|]E[V'] < Yl ^RMsJ) + t^{sJ) + tk{s,f) 
A'eA(R) 

= ttAis,f)+tkisJ), 



or 



PR{s,f) |Z)(s)| + Ak(s,/)E[A+]) E[V] 
< ipR{sJ)Y. ^A«(s)(1-P4^(s,/))+A«(5,/)E[A+]pME[^], 

\ uGdom A / 

where p is the probability of a newly-inserted document being relevant to the query. Cancelling the 
factor ofEj^ij)], another equivalent condition is 

Ph(s,/)|Z)(^)|+A^(s,/)E[A+] < PR{s,f)Y, RA,u{s){l-P^^:^[sJ))+lR{s,f)F.[Al]p\ 



MGdom A 



->RA( 



which is independent of the expected document size. Further assume that Puui^if) — P*',*^{s^f) 
is independent of u. Then the refresh condition can be expressed as 

Pr{s, f)D{.s) + Ar{s, f) E [A+] < pr{s, f) \B{s)\ (1 - P^f{s, /)) + 1r{s, /) E[A+] p^ 



u 
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6 Conclusion and topics for future research 

This paper represents a first step in a new research area, the stochastic estimation of the consistency 
of transcribed data over time. We have also suggested one possible technique for assigning a cost 
to the differences between two relation extensions, including a means of computing the expected 
value of this cost under our stochastic model. We have discussed a number of potential applications 
relating managing replicas, query management, and Web crawling. We have also examined several 
strategies for refreshing replicas, although other strategies are certainly possible. 

As an illustration of the low client-side computational demands of the insertion-only transcrip- 
tion application of our model, a Java-based demo, based on the transcription policies described in 
P^ and in this paper, can be accessed at http://rbs .rutgers .edu:6677/. The demo compares 
the performance of various policies using data that exist at a backend mSQL database. 

We hope to extend our work to the case where the materialized views are not simple replications, 
but are produced by SQL queries that involve selections, projections, natural joins, and certain 
types of aggregations. This work will involve a propagation algebra for tracing the base data 
changes through a series of relational operators. 

This development should make it possible to apply the theory to the management of more com- 
plex queries than presented here. In particular, it will facilitate a possible approach to managing 
general materialized view obsolescence on a query- by-query basis, taking into account current user 
preferences for query accuracy and speed. The refresh rate of materialized views in a periodically- 
updated data source (such as a data warehouse) can be defined in terms of data obsolescence, 
which in turn can be stochastically estimated using our model for content evolution. In this case. 



we advocate a three-way cost model for query optimization |11], in which the query optimizer eval- 
uates various query plans using three complementary factors, namely generation cost, transmission 
cost, and obsolescence cost. The first two factors take on a conventional interpretation and the 
obsolescence cost of a query represents a penalty for basing the query result on possibly obsolescent 
materialized views. A query plan using only selection from a local materialized view, for example, 
might have lower generation and transmission costs, but a higher obsolescence cost, than a plan 
fetching complete base relations from an extranet and then processing them through a series of join 
operations. Our model, when combined with additional techniques to propagate updates through 
relational operators, can be used as a basis for estimating the obsolescence cost. However, devel- 
oping the propagation algebra may require some enrichment of our basic model, in particular the 
introduction of dependency between the deletion and modification processes. 

We foresee several additional future research directions. One direction involves the design of 
efficient algorithms for the numerical computations required by our model. As it stands so far, the 
most demanding computations required are general numerical integration and the matrix exponen- 
tiation formula @. With regard to integration, we note that, in practice, the nonhomogeneous 
Poisson arrival rate functions A/j(-), fJ-ni-), and 7_R,yi(-) will most likely be chosen to be periodic 
piecewise low-order polynomials, as suggested in Section |^. In such cases, many of the integrals 
needed by the model could be performed in closed form within each time period. 

Further calibration and verification of the models in real situations is also needed. So far, 
we have demonstrated that the insertion model has plausible applications, but this work needs 
to be extended to the deletion and modification models. Furthermore, the insertion model may 
need to be generalized to handle situations where there is "burstiness" or autocorrelation in the 
interarrival times that may require more involved techniques than simply combining very closely 
spaced arrivals. 

Another future research direction involves applying the model to real-life settings such as manag- 
ing a data warehouse. While the model is quite flexible, a methodology is still needed for structuring 
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Markov chains and estimating the stochastic model's parameters. Finally, in order to calibrate the 
cost model, the issue of measuring user tolerance for data obsolescence should be considered. 
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